Fix for unexpected socket closures and data leakage under heavy load #646

todddialpad · 2024-11-25T17:20:42Z

This is to address issue #645 and in aiohttp/aiohappyeyeballs#93 and aiohttp/aiohappyeyeballs#112

1st1 · 2024-11-25T20:12:01Z

uvloop/loop.pyx

+                sockfd = sock.detach()
                # libuv will make socket non-blocking
-                tr._open(sock.fileno())
+                tr._open(sockfd)


The approach looks correct -- but I'm wondering how vanilla asyncio handles the same thing?

I think vanilla asyncio has an easier problem in that it can just have python sockets "all the way down", so just let reference counting take care of cleanup, while here we need to manage the disconnect with libuv dealing in file descriptors. I am suspecting there is some error handling path where a file descriptor is closed while the python socket object remains alive and not detached, so when it is finally closed, it messes up any new socket that happens to have the same file descriptor.
e.g. create socket s, call a loop method passing in an explicit socket, <bad error path which will end with sock.close()> overlapping with an .accept. I think the .accept never results in a python socket object being created.

So with the methods accepting sockets and other methods that internally work directly in file descriptors can there be a discrepancy?

MarkusSintonen · 2024-11-26T08:14:03Z

@todddialpad very nice! Do you know does it help with the other issue #506 which seems to be also related to incorrect sharing of sockets etc?

Any possibility to add some test here?

todddialpad · 2024-11-26T10:58:16Z

@todddialpad very nice! Do you know does it help with the other issue #506 which seems to be also related to incorrect sharing of sockets etc?

Any possibility to add some test here?

I am trying to get a stable test. It is tricky because it is a race condition, if my guess is correct. I think it is a race if TLS negotiation during a call to loop.create_connection with an explicit socket is cancelled, and a subsequent incoming connection is accepted before the CancelledError is propagated. I think both libuv (or uvloop) first and aiohttp second close the underlying file descriptor.

So if this is the case, I don't think this will fix issue #506 , which could be a similar but different root cause.

MarkusSintonen · 2024-11-26T11:56:08Z

So if this is the case, I don't think this will fix issue #506 , which could be a similar but different root cause.

Ok I see, the linked issue was also concerning as it looked as it was trying to write data into some incorrect socket. The error was also something we observed at similar time instances when we observed the response data getting leaked to incorrect requests. But we dont know is that issue actually related to the data leakage or just something else. (These RuntimeErrors dont happen with vanilla asyncio)

todddialpad · 2024-12-02T21:09:00Z

@todddialpad very nice! Do you know does it help with the other issue #506 which seems to be also related to incorrect sharing of sockets etc?
Any possibility to add some test here?

I am trying to get a stable test. It is tricky because it is a race condition, if my guess is correct. I think it is a race if TLS negotiation during a call to loop.create_connection with an explicit socket is cancelled, and a subsequent incoming connection is accepted before the CancelledError is propagated. I think both libuv (or uvloop) first and aiohttp second close the underlying file descriptor.

So if this is the case, I don't think this will fix issue #506 , which could be a similar but different root cause.

I still haven't been able to isolate a standalone, self-contained test. The test environment in which I generated the same error we see in production involves 2 VMs with significant network latency between them. The first of the VMs is just a web server, the second is a web server that accepts requests, and then makes outgoing client requests (using aiohttp) to the first webserver with TLS and a short timeout (around 1 second).

With this setup, I quite reliably get a failure within 250 connections. When I run with this patch applied, I have never had a failure in 20,000 connections.

We have also run this in our production environment. When we first encountered this failure, we hit it within 1 hour of using aiohttp >= 3.10. Since running with this patch we have been running for 5 days with no failures.

todddialpad · 2024-12-04T00:29:16Z

Is accepting this blocked on the tests that are failing? I don't think those failures are related to this change, as they are also failing for PR #644, which is solely a documentation change.

I looked at the test logs and I would guess that a dependency is causing the changed results. Related to this, I notice that in the failing tests, and alpha release of Cython 3.1 is being used (Using cached Cython-3.1.0a1-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata). Is this intentional?
In the last test run that passed, the release version was used (Using cached Cython-3.0.11-cp313-cp313-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata)

AntonArsentiev · 2025-01-27T15:25:58Z

Hello everyone. Did i think right that this MR fix issues below?

"RuntimeError: File descriptor 2877 is used by transport <TCPTransport closed=False reading=True 0x55b8dc9baa90>"

AntonArsentiev · 2025-02-03T21:18:14Z

Hello everyone :)

Like many other users of this library, I would be happy for this fix to be implemented in one of the upcoming releases.
Could you figure out roughly when to expect this fix?

Dreamsorcerer · 2025-02-16T20:18:43Z

@1st1 @fantix Is anyone available to merge and release this? We're getting asked to put workarounds into aiohttp to deal with this, would be nice to have this fixed here instead.

bdraco · 2025-03-31T23:25:18Z

We added a workaround for this issue in aio-libs/aiohttp#10464 but its causing issues when using with asyncio SelectorEventLoop aio-libs/aiohttp#10617 so we will likely be reverting it and waiting for this PR instead

webknjaz · 2025-03-31T23:38:36Z

Hey @elprans @1st1, do you think you'll be able to spare a minute to get this in?

…10464 fixes #10617 alternative fix is MagicStack/uvloop#646

@top-oai

…10464 (#10656) Reverts #10464 While this change improved the situation for uvloop users, it caused a regression with `SelectorEventLoop` (issue #10617) The alternative fix is MagicStack/uvloop#646 (not merged at the time of this PR) issue #10617 appears to be very similar to python/cpython@d5aeccf If someone can come up with a working reproducer for #10617 we can revisit this. cc @top-oai Minimal implementation that shows on cancellation the socket is cleaned up without the explicit `close` #10617 (comment) so this should be unneeded unless I've missed something (very possible with all the moving parts here) ## Related issue number fixes #10617

@top-oai

…10464 (#10656) Reverts #10464 While this change improved the situation for uvloop users, it caused a regression with `SelectorEventLoop` (issue #10617) The alternative fix is MagicStack/uvloop#646 (not merged at the time of this PR) issue #10617 appears to be very similar to python/cpython@d5aeccf If someone can come up with a working reproducer for #10617 we can revisit this. cc @top-oai Minimal implementation that shows on cancellation the socket is cleaned up without the explicit `close` #10617 (comment) so this should be unneeded unless I've missed something (very possible with all the moving parts here) ## Related issue number fixes #10617 (cherry picked from commit 06db052)

@top-oai

…10464 (#10656) Reverts #10464 While this change improved the situation for uvloop users, it caused a regression with `SelectorEventLoop` (issue #10617) The alternative fix is MagicStack/uvloop#646 (not merged at the time of this PR) issue #10617 appears to be very similar to python/cpython@d5aeccf If someone can come up with a working reproducer for #10617 we can revisit this. cc @top-oai Minimal implementation that shows on cancellation the socket is cleaned up without the explicit `close` #10617 (comment) so this should be unneeded unless I've missed something (very possible with all the moving parts here) ## Related issue number fixes #10617 (cherry picked from commit 06db052)

@top-oai

…'s a failure in start_connection() #10464 (#10657) **This is a backport of PR #10656 as merged into master (06db052).** Reverts #10464 While this change improved the situation for uvloop users, it caused a regression with `SelectorEventLoop` (issue #10617) The alternative fix is MagicStack/uvloop#646 (not merged at the time of this PR) issue #10617 appears to be very similar to python/cpython@d5aeccf If someone can come up with a working reproducer for #10617 we can revisit this. cc @top-oai Minimal implementation that shows on cancellation the socket is cleaned up without the explicit `close` #10617 (comment) so this should be unneeded unless I've missed something (very possible with all the moving parts here) ## Related issue number fixes #10617 Co-authored-by: J. Nick Koston <nick@koston.org>

1st1 · 2025-04-16T19:50:25Z

Hey @elprans @1st1, do you think you'll be able to spare a minute to get this in?

Sorry, @fantix and I will be going through this PR and others this week.

jperezr21 · 2025-04-24T20:19:09Z

Hey! I noticed that aiohttp 3.11.14 has been yanked. For those of us using uvloop and aiohttp and running into the File descriptor 91 is used by transport error, do you happen to know if there’s a temporary workaround or a specific combination of versions we can pin to in the meantime? Totally understand if we need to wait for this to be merged, just trying to keep things running smoothly in the short term. Thanks a lot!

Dreamsorcerer · 2025-04-24T21:20:37Z

You can pin to a yanked version.

jugalshah291 · 2025-05-08T20:59:23Z

I folks
We are also running into below issue,

File descriptor 91 is used by transport

Wonder if their is a fix

Our setup

aiohappyeyeballs==2.6.1
aiohttp==3.11.18
aiohttp-cors==0.8.1
uvloop==0.21.0

jperezr21 · 2025-05-08T21:38:06Z

Pinning to aiohttp==3.11.14 solved it for me

bdraco · 2025-05-08T23:44:07Z

Pinning to aiohttp==3.11.14 solved it for me

Just a heads-up: if you're using the default asyncio event loop (typically SelectorEventLoop), pinning to aiohttp==3.11.14 may introduce other issues due to some side effects in that version. If you're using uvloop exclusively, it's likely fine.

Ideally, we were hoping this PR would be merged to avoid relying on workarounds in aiohttp as we've already been down that road, had to revert, and don’t want a repeat. Unfortunately, this PR seems to have stalled.

jugalshah291 · 2025-05-09T17:47:21Z

Can we please prioritize this PR, it seems to be impacting many users

todddialpad · 2025-05-09T23:20:52Z

@fantix thanks for having a look at this. I am not sure what to make of the test failures. The failures seem to be all related to Unix transports and subprocess transports. The PR only should affect TCP transports. I'm trying to repro. My dev environment is Ubuntu / py3.12, and the tests that are failing here are passing there. For example:

test_process_send_signal_1 (test_process.Test_UV_Process.test_process_send_signal_1) ... ok
test_process_streams_basic_1 (test_process.Test_UV_Process.test_process_streams_basic_1) ... ok
test_process_streams_devnull (test_process.Test_UV_Process.test_process_streams_devnull) ... ok
test_process_streams_pass_fds (test_process.Test_UV_Process.test_process_streams_pass_fds) ... ok

Do you have any ideas on how to proceed?

fantix · 2025-05-09T23:23:34Z

They are breaking in the debug build, maybe try this:

uvloop/.github/workflows/tests.yml

Lines 68 to 71 in 96b7ed3

    
               - name: Test (debug build) 
        
                 if: steps.release.outputs.version == 0 
        
                 run: | 
        
                   make distclean && make debug && make test

todddialpad · 2025-05-09T23:59:36Z

Tests / test (3.12, macos-latest) (pull_request)

Yes, I get the failures with the debug build, good eye. Thanks.

I have instrumented the changed code, and in a failing test, the modifications never even run (which makes sense since the test isn't creating any TCP connections).

I have built without this patch, and still see the failures with the debug build.

git clone --recursive https://github.com/magicstack/uvloop.git uvloop.official
cd uvloop.official/
python3 -m venv uvloop-dev
source uvloop-dev/bin/activate
pip install -e .[dev]
pip install psutil
make debug
make test

======================================================================
FAIL: test_process_streams_pass_fds (test_process.Test_UV_Process.test_process_streams_pass_fds) [Alive handle after test] (handle_name='UVProcessTransport')
----------------------------------------------------------------------
Traceback (most recent call last):
  File "uvloop.official/uvloop/_testbase.py", line 142, in tearDown
    self.assertEqual(
AssertionError: 1 != 0 : alive UVProcessTransport after test

So, could an upstream dependency have broken the debug build?

todddialpad · 2025-05-10T00:31:00Z

So, could an upstream dependency have broken the debug build?

Since the last successful test run, the following upstream dependencies have changed:

Cython-3.1.0 (was 3.0.12)
aiohttp-3.11.18 (was 3.11.16)
frozenlist-1.6.0 (was 1.5.0)
mypy_extensions-1.1.0 (was 1.0.0)
setuptools-80.3.1 (was 78.1.0)

I rebuilt using Cython-3.0.12 and the tests passed.

Would a manual execution of the tests on the main branch still pass (assuming it will grab Cython 3.1.0)?

todddialpad · 2025-05-11T20:31:47Z

Would a manual execution of the tests on the main branch still pass (assuming it will grab Cython 3.1.0)?

I forked the main branch and tried running the tests. It fails with Cython 3.1.0. I pinned Cython to < 3.1.0 and the tests pass. I included this PR, and with the pinned Cython, all tests pass.

So I believe this PR could be merged. I created an issue for Cython 3.1.0 #677 .

jugalshah291 · 2025-05-26T16:13:07Z

Hi checking back on this, any ETA on when it would be merged

GabrielSalla · 2025-06-01T01:06:09Z

Following this PR waiting for the fix

Uvloop has a bug that is preventing from updating the libraries so it's being deactivated for now. MagicStack/uvloop#646

jugalshah291 · 2025-06-16T15:58:29Z

Hi folks any ETA on this

jugalshah291 · 2025-07-30T21:41:39Z

Hi Folks
can someone provide an update on this
We are facing an issue where we suddenly start seeing File descriptor 91 is used by transport on a running process and while the error happens the process is not able to serve traffic. This is impacting our service stability

MarkusSintonen · 2025-07-31T07:41:52Z

We stopped using uvloop and didnt really observe any performance impact. Probably better to stop using it until the issue is fixed. Especially as we also observed information to get leaked under heavy load. (that issue is hard to reproduce locally)

yybdyybd · 2025-09-25T04:08:30Z

Hi FolksHi Folks 大家好 can someone provide an update on this can someone provide an update on this有人能提供一下这个的最新情况吗 We are facing an issue where we suddenly start seeing We are facing an issue where we suddenly start seeing 我们遇到了一个问题，即突然开始看到File descriptor 91 is used by transport on a running process and while the error happens the process is not able to serve traffic. This is impacting our service stability on a running process and while the error happens the process is not able to serve traffic. This is impacting our service stability在一个正在运行的进程上，当错误发生时，该进程无法处理流量。这正影响着我们的服务稳定性。

Hello, I also have this problem. This problem occurs when calling a third-party interface times out and is in a high-concurrency scenario. Have you solved it? Please advise.

Dreamsorcerer · 2025-09-25T13:38:59Z

Have you solved it? Please advise.

Uninstall uvloop? Several users have reported that the performance difference is small today, so if it's breaking your application...

MarkusSintonen · 2025-09-25T17:04:12Z

Have you solved it?

Uninstall uvloop? Several users have reported that the performance difference is small today, so if it's breaking your application...

Yes we uninstalled it and did not observe any change in performance. (We process non-trivial amount of requests, +30K RPS, in highly concurrent servers.).

x0day · 2025-09-28T01:38:29Z

gently ping @fantix @1st1

> Briefly describe what this PR accomplishes and why it's needed. Our serve ingress keeps running into below error related to `uvloop` under heavy load ``` File descriptor 97 is used by transport ``` The uvloop team have a [PR](MagicStack/uvloop#646) to fix it, but seems like no one is working on it One of workaround mentioned in the ([PR](MagicStack/uvloop#646 (comment))) is to just turn off uvloop . We tried it in our env and didn't see any major performance difference Hence as part of this PR, we are defining a new env for controlling UVloop Signed-off-by: jugalshah291 <shah.jugal291@gmail.com>

commit b3a8434d35f7af0322e3b766b1a1809bd29c2837 Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com> Date: Thu Nov 13 14:31:31 2025 -0800 [doc] remove python 3.12 in doc building (#58572) unifying to python 3.10 Signed-off-by: Lonnie Liu <lonnie@anyscale.com> commit 31f904f630809152ceba67c8bf1684c8c9b685ea Author: Andrew Sy Kim <andrewsy@google.com> Date: Thu Nov 13 17:27:23 2025 -0500 Add support for RAY_AUTH_MODE=k8s (#58497) This PR adds initial support for RAY_AUTH_MODE=k8s. In this mode, Ray will delegate authentication and authorization of Ray access to Kubernetes TokenReview and SubjectAccessReview APIs. --------- Signed-off-by: Andrew Sy Kim <andrewsy@google.com> commit ade535a9519c19c25aa50c562d2c27128b3ca356 Author: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com> Date: Thu Nov 13 14:08:29 2025 -0800 [serve] fix serve dashboard metric name (#58573) Prometheus auto-append the `_total` suffix to all Counter metrics. Ray historically has been supported counter metric with and without `_total` suffix for backward compatibility, but it is now time to drop the support (2 years since the warning was added). There is one place in ray serve dashboard that still doesn't use the `_total` suffix so fix it in this PR. Test: - CI Signed-off-by: Cuong Nguyen <can@anyscale.com> commit 62a33c29d23a5c1fb91a969b9aea3ffe1f8281cc Author: Rui Qiao <161574667+ruisearch42@users.noreply.github.com> Date: Thu Nov 13 13:33:33 2025 -0800 [Serve.LLM] Add avg prompt length metric (#58599) Add avg prompt length metric When using uniform prompt length (especially in testing), the P50 and P90 computations are skewed due to the 1_2_5 buckets used in vLLM. Average prompt length provides another useful dimension to look at and validate. For example, using uniformly ISL=5000, P50 shows 7200 and P90 shows 9400, and avg accurately shows 5000. <img width="1186" height="466" alt="image" src="https://github.com/user-attachments/assets/4615c3ca-2e15-4236-97f9-72bc63ef9d1a" /> --------- Signed-off-by: Rui Qiao <ruisearch42@gmail.com> Signed-off-by: Rui Qiao <161574667+ruisearch42@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> commit 0c4dcb032ce03a771c3b6276fb661cfc6b839c01 Author: Elliot Barnwell <elliot.barnwell@anyscale.com> Date: Thu Nov 13 12:42:49 2025 -0800 [release] allowing for py3.13 images (cpu & cu123) in release tests (#58581) allowing for py3.13 images (cpu & cu123) in release tests Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com> commit c3ba35e6cb1ce4030d8d361a921a697af516fbca Author: Goutam <goutam@anyscale.com> Date: Thu Nov 13 12:26:10 2025 -0800 [Data] - [1/n] Add Temporal, list, tensor, struct datatype support to RD Datatype (#58225) As title suggests > Link related issues: "Fixes #1234", "Closes #1234", or "Related to > Optional: Add implementation details, API changes, usage examples, screenshots, etc. Signed-off-by: Goutam <goutam@anyscale.com> commit af20446c362a8f4d17b9226d944a3242b0acafaf Author: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com> Date: Thu Nov 13 12:18:38 2025 -0800 [core] fix get_metric_check_condition tests (#58598) Fix `get_metric_check_condition` to use `fetch_prometheus_timeseries`, which is a non-flaky version of `fetch_prometheus`. Update all of test usage accordingly. Test: - CI --------- Signed-off-by: Cuong Nguyen <can@anyscale.com> Signed-off-by: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> commit f1c613dc386268beec06b6c57c12191218ae7e74 Author: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com> Date: Thu Nov 13 12:14:04 2025 -0800 [core] add an option to disable otel sdk error logs (#58257) Currently, Ray metrics and events are exported through a centralized process called the Dashboard Agent. This process functions as a gRPC server, receiving data from all other components (GCS, Raylet, workers, etc.). However, during a node shutdown, the Dashboard Agent may terminate before the other components, resulting in gRPC errors and potential loss of metrics and events. As this issue occurs, the otel sdk logs become very noisy. Add a default options to disable otel sdk logs to avoid confusion. Test: - CI Signed-off-by: Cuong Nguyen <can@anyscale.com> commit 638933ef4aabe24b5def68d72f21e772e354e853 Author: Abrar Sheikh <abrar@anyscale.com> Date: Thu Nov 13 11:41:29 2025 -0800 [1/n] [Serve] Refactor replica rank to prepare for node local ranks (#58471) 2. **Extracted generic `RankManager` class** - Created reusable rank management logic separated from deployment-specific concerns 3. **Introduced `ReplicaRank` schema** - Type-safe rank representation replacing raw integers 4. **Simplified error handling** - not supporting self healing 5. **Updated tests** - Refactored unit tests to use new API and removed flag-dependent test cases **Impact:** - Cleaner separation of concerns in rank management - Foundation for future multi-level rank support Next PR https://github.com/ray-project/ray/pull/58473 --------- Signed-off-by: abrar <abrar@anyscale.com> commit 5d5113134bce5929ff7504f733bbee44a7de2987 Author: Kunchen (David) Dai <54918178+Kunchd@users.noreply.github.com> Date: Thu Nov 13 11:21:50 2025 -0800 [Core] Refactor reference_counter out of memory store and plasma store (#57590) As discovered in the [PR to better define the interface for reference counter](https://github.com/ray-project/ray/pull/57177#pullrequestreview-3312168933), plasma store provider and memory store both share thin dependencies on reference counter that can be refactored out. This will reduce entanglement in our code base and improve maintainability. The main logic changes are located in * src/ray/core_worker/store_provider/plasma_store_provider.cc, where reference counter related logic is refactor into core worker * src/ray/core_worker/core_worker.cc, where factored out reference counter logic is resolved * src/ray/core_worker/store_provider/memory_store/memory_store.cc, where logic related to reference counter has either been removed due to the fact that it is tech debt or refactored into caller functions.   Microbenchmark: ``` single client get calls (Plasma Store) per second 10592.56 +- 535.86 single client put calls (Plasma Store) per second 4908.72 +- 41.55 multi client put calls (Plasma Store) per second 14260.79 +- 265.48 single client put gigabytes per second 11.92 +- 10.21 single client tasks and get batch per second 8.33 +- 0.19 multi client put gigabytes per second 32.09 +- 1.63 single client get object containing 10k refs per second 13.38 +- 0.13 single client wait 1k refs per second 5.04 +- 0.05 single client tasks sync per second 960.45 +- 15.76 single client tasks async per second 7955.16 +- 195.97 multi client tasks async per second 17724.1 +- 856.8 1:1 actor calls sync per second 2251.22 +- 63.93 1:1 actor calls async per second 9342.91 +- 614.74 1:1 actor calls concurrent per second 6427.29 +- 50.3 1:n actor calls async per second 8221.63 +- 167.83 n:n actor calls async per second 22876.04 +- 436.98 n:n actor calls with arg async per second 3531.21 +- 39.38 1:1 async-actor calls sync per second 1581.31 +- 34.01 1:1 async-actor calls async per second 5651.2 +- 222.21 1:1 async-actor calls with args async per second 3618.34 +- 76.02 1:n async-actor calls async per second 7379.2 +- 144.83 n:n async-actor calls async per second 19768.79 +- 211.95 ``` This PR mainly makes logic changes to the `ray.get` call chain. As we can see from the benchmark above, the single clientget calls performance matches pre-regression levels. --------- Signed-off-by: davik <davik@anyscale.com> Co-authored-by: davik <davik@anyscale.com> Co-authored-by: Ibrahim Rabbani <irabbani@anyscale.com> commit 2352e6b8e1e4488822eb787e6112c18c1964fbe0 Author: Sampan S Nayak <sampansnayak2@gmail.com> Date: Fri Nov 14 00:49:39 2025 +0530 [Core] Support get-auth-token cli command (#58566) add support for `ray get-auth-token` cli command + test --------- Signed-off-by: sampan <sampan@anyscale.com> Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: Sampan S Nayak <sampansnayak2@gmail.com> Co-authored-by: sampan <sampan@anyscale.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> commit ea5bc3491a74e2b71f4cb6fdb14787fdcb3314fc Author: Sampan S Nayak <sampansnayak2@gmail.com> Date: Fri Nov 14 00:37:23 2025 +0530 [Core] Migrate to HttpOnly cookie-based authentication for enhanced security (#58591) Migrates Ray dashboard authentication from JavaScript-managed cookies to server-side HttpOnly cookies to enhance security against XSS attacks. This addresses code review feedback to improve the authentication implementation (https://github.com/ray-project/ray/pull/58368) main changes: - authentication middleware first looks for `Authorization` header, if not found it then looks at cookies to look for the auth token - new `api/authenticate` endpoint for verifying token and setting the auth token cookie (with `HttpOnly=true`, `SameSite=Strict` and `secure=true` (when using https)) - removed javascript based cookie manipulation utils and axios interceptors (were previously responsible for setting cookies) - cookies are deleted when connecting to a cluster with `AUTH_MODE=disabled`. connecting to a different ray cluster (with different auth token) using the same endpoint (eg due to port-forwarding or local testing) will reshow the popup and ask users to input the right token. --------- Signed-off-by: sampan <sampan@anyscale.com> Co-authored-by: sampan <sampan@anyscale.com> commit 0905c77db5acd286a6ba84a907c60ad2b15416dd Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com> Date: Thu Nov 13 10:41:57 2025 -0800 [ci] doc check: remove dependency on `ray_ci` (#58516) this makes it possible to run on a different python version than the CI wrapper code. Signed-off-by: Lonnie Liu <lonnie@anyscale.com> Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com> commit 0bbd8fd22e0447ec66c12e67afc973e95523451b Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com> Date: Thu Nov 13 10:35:38 2025 -0800 [ci] mark github.Repository as typechecking (#58582) so that importing test.py does not always import github github repo imports jwt, which then imports cryptography and can lead to issues on windows. Signed-off-by: Lonnie Liu <lonnie@anyscale.com> commit 208970b5b399133a41557db8b16ad6832180e6b7 Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com> Date: Thu Nov 13 10:35:23 2025 -0800 [wheel] stop building python 3.9 wheels on the pipelines (#58587) also stops building python 3.9 aarch64 images Signed-off-by: Lonnie Liu <lonnie@anyscale.com> commit 33e855e42baaa1ebf4f3f0a1f96f00e87fdc1d11 Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com> Date: Thu Nov 13 10:32:21 2025 -0800 [serve] run tests in python 3.10 (#58586) all tests are passing Signed-off-by: Lonnie Liu <lonnie@anyscale.com> commit 5e8433d3cf8b6bea3366094bb4ecfc6f410dec01 Author: Zac Policzer <zac@anyscale.com> Date: Thu Nov 13 07:37:52 2025 -0800 [core] Add monitoring in raylet for resouce view (#58382) We today have very little observability into pubsub. On a raylet one of the most important states that need to be propagated through the cluster via pubsub is cluster membership. All raylets should in an eventual BUT timely fashion agree on the list of available nodes. This metric just emits a simple counter to keep track of the node count. More pubsub observability to come. > Link related issues: "Fixes #1234", "Closes #1234", or "Related to > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: zac <zac@anyscale.com> Signed-off-by: Zac Policzer <zacattackftw@gmail.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> commit dde70e76e5aa993e9224a2d173a053a35a132ebd Author: Xinyu Zhang <60529799+xyuzh@users.noreply.github.com> Date: Wed Nov 12 23:04:37 2025 -0800 [Data] Fix HTTP streaming file download by using `open_input_stream` (#58542) Fixes HTTP streaming file downloads in Ray Data's download operation. Some URIs (especially HTTP streams) require `open_input_stream` instead of `open_input_file`. - Modified `download_bytes_threaded` in `plan_download_op.py` to try both `open_input_file` and `open_input_stream` for each URI - Improved error handling to distinguish between different error types - Failed downloads now return `None` gracefully instead of crashing ``` import pyarrow as pa from ray.data.context import DataContext from ray.data._internal.planner.plan_download_op import download_bytes_threaded urls = [ "https://static-assets.tesla.com/configurator/compositor?context=design_studio_2?&bkba_opt=1&view=STUD_3QTR&size=600&model=my&options=$APBS,$IPB7,$PPSW,$SC04,$MDLY,$WY19P,$MTY46,$STY5S,$CPF0,$DRRH&crop=1150,647,390,180&", ] table = pa.table({"url": urls}) ctx = DataContext.get_current() results = list(download_bytes_threaded(table, ["url"], ["bytes"], ctx)) result_table = results[0] for i in range(result_table.num_rows): url = result_table['url'][i].as_py() bytes_data = result_table['bytes'][i].as_py() if bytes_data is None: print(f"Row {i}: FAILED (None) - try-catch worked ✓") else: print(f"Row {i}: SUCCESS ({len(bytes_data)} bytes)") print(f" URL: {url[:60]}...") print("\n✅ Test passed: Failed downloads return None instead of crashing.") ``` Before the fix: ``` TypeError: cannot set 'open_input_file' attribute of immutable type 'pyarrow._fs.FileSystem' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/ray/default/test_streaming_fallback.py", line 110, in <module> test_download_expression_with_streaming_fallback() File "/home/ray/default/test_streaming_fallback.py", line 67, in test_download_expression_with_streaming_fallback with patch.object(pafs.FileSystem, "open_input_file", mock_open_input_file): ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ray/anaconda3/lib/python3.12/unittest/mock.py", line 1594, in __enter__ if not self.__exit__(*sys.exc_info()): ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ray/anaconda3/lib/python3.12/unittest/mock.py", line 1603, in __exit__ setattr(self.target, self.attribute, self.temp_original) TypeError: cannot set 'open_input_file' attribute of immutable type 'pyarrow._fs.FileSystem' (base) ray@ip-10-0-39-21:~/default$ python test.py 2025-11-11 18:32:23,510 WARNING util.py:1059 -- Caught exception in transforming worker! Traceback (most recent call last): File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/util.py", line 1048, in _run_transforming_worker for result in fn(input_queue_iter): ^^^^^^^^^^^^^^^^^^^^ File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/planner/plan_download_op.py", line 197, in load_uri_bytes yield f.read() ^^^^^^^^ File "pyarrow/io.pxi", line 411, in pyarrow.lib.NativeFile.read File "pyarrow/io.pxi", line 263, in pyarrow.lib.NativeFile.size File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status File "/home/ray/anaconda3/lib/python3.12/site-packages/fsspec/implementations/http.py", line 743, in seek raise ValueError("Cannot seek streaming HTTP file") ValueError: Cannot seek streaming HTTP file Traceback (most recent call last): File "/home/ray/default/test.py", line 16, in <module> results = list(download_bytes_threaded(table, ["url"], ["bytes"], ctx)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/planner/plan_download_op.py", line 207, in download_bytes_threaded uri_bytes = list( ^^^^^ File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/util.py", line 1113, in make_async_gen raise item File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/util.py", line 1048, in _run_transforming_worker for result in fn(input_queue_iter): ^^^^^^^^^^^^^^^^^^^^ File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/data/_internal/planner/plan_download_op.py", line 197, in load_uri_bytes yield f.read() ^^^^^^^^ File "pyarrow/io.pxi", line 411, in pyarrow.lib.NativeFile.read File "pyarrow/io.pxi", line 263, in pyarrow.lib.NativeFile.size File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 89, in pyarrow.lib.check_status File "/home/ray/anaconda3/lib/python3.12/site-packages/fsspec/implementations/http.py", line 743, in seek raise ValueError("Cannot seek streaming HTTP file") ValueError: Cannot seek streaming HTTP file ``` After the fix: ``` Row 0: SUCCESS (189370 bytes) URL: https://static-assets.tesla.com/configurator/compositor?cont... ``` Tested with HTTP streaming URLs (e.g., Tesla configurator images) that previously failed: - ✅ Successfully downloads HTTP stream files - ✅ Gracefully handles failed downloads (returns None) - ✅ Maintains backward compatibility with existing file downloads --------- Signed-off-by: xyuzh <xinyzng@gmail.com> Signed-off-by: Robert Nishihara <robertnishihara@gmail.com> Co-authored-by: Robert Nishihara <robertnishihara@gmail.com> commit 438d6dcf225b7b03ba75ce9593050971458b94ac Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com> Date: Wed Nov 12 22:19:50 2025 -0800 [ci] pin docker client version (#58579) otherwise, the newer docker client will refuse to communicate with the docker daemon that is on an older version. Signed-off-by: Lonnie Liu <lonnie@anyscale.com> commit 633bb7b1d57ca58a05e905ee4551ee5f96d71750 Author: Elliot Barnwell <elliot.barnwell@anyscale.com> Date: Wed Nov 12 22:08:45 2025 -0800 [deps] adding include_setuptools flag for depset config (#58580) Adding optional `include_setuptools` flag for depset configuration If the flag is set on a depset config --unsafe-package setuptools will not be included for depset compilation If the flag does not exist (default false) on a depset config --unsafe-package setuptools will be appended to the default arguments --------- Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com> Co-authored-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com> commit 292b977661b1ee9804bc0c6a3d3fbecd2b89ec25 Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com> Date: Wed Nov 12 20:36:43 2025 -0800 [serve] remove minbuild-serve-py3.9 (#58585) nothing is using it anymore Signed-off-by: Lonnie Liu <lonnie@anyscale.com> commit 0cdbe3f24132c69c4d6ce9322f85de767b660135 Author: Ibrahim Rabbani <irabbani@anyscale.com> Date: Wed Nov 12 18:48:27 2025 -0800 [core] (cgroups) Use /proc/mounts if mount file is missing. (#58577) Signed-off-by: irabbani <irabbani@anyscale.com> commit 22fbee343bc5326b2912ee24eb8faa8517ea29ec Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com> Date: Wed Nov 12 18:26:25 2025 -0800 [deps] update `requirements_buildkite.txt` (#58574) as the pydantic version is pinned in `requirements-doc.txt` now. Signed-off-by: Lonnie Liu <lonnie@anyscale.com> commit 7a6e29e96b1fa33ad5ff45e37d6f4da7eadd822a Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com> Date: Wed Nov 12 16:38:54 2025 -0800 Revert "[bazel] upgrade bazel python rules to 0.25.0" (#58578) Reverts ray-project/ray#58535 failing on windows.. :( commit 2f55d078bb69f39198eccf6293683e17a2e72dc5 Author: Goutam <goutam@anyscale.com> Date: Wed Nov 12 16:37:24 2025 -0800 [Data] - Iceberg support upsert tables + schema update + overwrite tables (#58270) - Support upserting iceberg tables for IcebergDatasink - Update schema on APPEND and UPSERT - Enable overwriting the entire table Upgrades to pyicberg 0.10.0 because it now supports upsert and overwrite functionality. Also for append, the library now handles the transaction logic implicitly so that burden can be lifted from Ray Data. > Link related issues: "Fixes #1234", "Closes #1234", or "Related to > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Goutam <goutam@anyscale.com> commit d6793ecdbc4e6043cc0b0f19862b4b0c8256bb7f Author: Joshua Lee <73967497+Sparks0219@users.noreply.github.com> Date: Wed Nov 12 16:31:26 2025 -0800 [core] Use GetNodeAddressAndLiveness in raylet client pool (#58576) Using GetNodeAddressAndLiveness in raylet client pool instead of the bulkier Get, same for AsyncGetAll. Seems like it was already done in core worker client pool, so just making the same change for raylet client pool. Signed-off-by: joshlee <joshlee@anyscale.com> commit e713b3de319afd437f2de7435f5a2870167fa99a Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com> Date: Wed Nov 12 15:01:35 2025 -0800 [doc] set default python env to 3.10 (#58570) we stop supporting building with python 3.9 now Signed-off-by: Lonnie Liu <lonnie@anyscale.com> commit 8e4b32e0366a9b32f7dfbd55d5dd5a30fc5c734b Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com> Date: Wed Nov 12 15:01:20 2025 -0800 [bazel] rename contraint from hermatic to python_version (#58499) which is more accurate also moves python constraint definitions into `bazel/` directory and registering python 3.10 platform with hermetic toolchain this allows performing migration from python 3.19 to python 3.10 incrementally Signed-off-by: Lonnie Liu <lonnie@anyscale.com> commit 0d56f3ef9ae32c5ce8543bb76d9ccde120140623 Author: Elliot Barnwell <elliot.barnwell@anyscale.com> Date: Wed Nov 12 14:23:17 2025 -0800 [images][deps] raydepsets base extra depset (#58461) generating depsets for base extra python requirements Installing requirements in base extra image --------- Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com> commit df65225e4f98bce2b45405b1cf89fb70556e2871 Author: Daniel Shin <88547237+kyuds@users.noreply.github.com> Date: Thu Nov 13 07:08:15 2025 +0900 [Data] Use Approximate Quantile for RobustScaler Preprocessor (#58371) Currently Ray Data has a preprocessor called `RobustScaler`. This scales the data based on given quantiles. Calculating the quantiles involves sorting the entire dataset by column for each column (C sorts for C number of columns), which, for a large dataset, will require a lot of calculations. ** MAJOR EDIT **: had to replace the original `tdigest` with `ddsketch` as I couldn't actually find well-maintained tdigest libraries for python. ddsketch is better maintained. ** MAJOR EDIT 2 **: discussed offline to use `ApproximateQuantile` aggregator N/A N/A --------- Signed-off-by: kyuds <kyuseung1016@gmail.com> Signed-off-by: Daniel Shin <kyuseung1016@gmail.com> Co-authored-by: You-Cheng Lin <106612301+owenowenisme@users.noreply.github.com> commit 5e71d58badbfdcfc002826398c3e02469065cc71 Author: Sampan S Nayak <sampansnayak2@gmail.com> Date: Thu Nov 13 03:33:18 2025 +0530 [Core] support token auth in ray client server (#58557) support token auth in ray client server by using the existing grpc interceptors. This pr refactors the code to: - add/rename sync and async client and server interceptors - create grpc utils to house grpc channel and server creation logic, python codebase is updated to use these methods - separate tests for sync and async interceptors - make existing authentication integration tests to run with RAY_CLIENT mode --------- Signed-off-by: sampan <sampan@anyscale.com> Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: Sampan S Nayak <sampansnayak2@gmail.com> Co-authored-by: sampan <sampan@anyscale.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> commit a6cc5499e7fa07c0d6cdc7b7cd0b08dfc08073dd Author: Kunchen (David) Dai <54918178+Kunchd@users.noreply.github.com> Date: Wed Nov 12 13:45:02 2025 -0800 [Core] Move request id creation to worker to address plasma get perf regression (#58390) This PR address the performance regression introduced in the [PR to make ray.get thread safe](https://github.com/ray-project/ray/pull/57911). Specifically, the previous PR requires the worker to block and wait for AsyncGet to return with a reply of the request id needed for correctly cleaning up get requests. This additional synchronous step causes the plasma store Get to regress in performance. This PR moves the request id generation step to the plasma store, removing the blocking step to fix the perf regression. - [PR which introduced perf regression](https://github.com/ray-project/ray/pull/57911) - [PR which observed the regression](https://github.com/ray-project/ray/pull/58175) New performance of the change measured by `ray microbenchmark`. <img width="485" height="17" alt="image" src="https://github.com/user-attachments/assets/b96b9676-3735-4e94-9ade-aaeb7514f4d0" /> Original performance prior to the change. Here we focus on the regressing `single client get calls (Plasma Store)` metric, where our new performance returns us back to the original 10k per second range compared to the existing sub 5k per second. <img width="811" height="355" alt="image" src="https://github.com/user-attachments/assets/d1fecf82-708e-48c4-9879-34c59a5e056c" /> --------- Signed-off-by: davik <davik@anyscale.com> Co-authored-by: davik <davik@anyscale.com> commit 9e450e6805824ac825488e1455ac97f93df0bbc3 Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com> Date: Wed Nov 12 12:36:21 2025 -0800 [doc] symlink the doc dependency lock file (#58520) and ask people to use that lock file for building docs. Signed-off-by: Lonnie Liu <lonnie@anyscale.com> commit 16c2f5fffbd1d772606de28ac39c0bb7182efdd4 Author: Lehui Liu <lehui@anyscale.com> Date: Wed Nov 12 12:08:28 2025 -0800 [train] Set JAX_PLATFORMS env var based on ScalingConfig (#57783) 1. JaxTrainer relying on the runtime env var "JAX_PLATFORMS" to be set to initialize jax.distributed: https://github.com/ray-project/ray/blob/master/python/ray/train/v2/jax/config.py#L38 2. Before this change, user will have to configure both `use_tpu=True` in `ray.train.ScalingConfig` and passing `JAX_PLATFORMS=tpu` to be able to start jax.distributed. `JAX_PLATFORMS` can be comma separated string. 3. If user uses other jax.distributed libraries like Orbax, sometimes, it will leads to misleading error about distributed initialization. 4. After this change, if user sets `use_tpu=True`, we automatically add this to env var. 5. tpu unit test is not available this time, will explore for how to cover it later. --------- Signed-off-by: Lehui Liu <lehui@anyscale.com> commit 1ab16e26a0251d3964637c6fe0f2f9a0ae8c6312 Author: iamjustinhsu <140442892+iamjustinhsu@users.noreply.github.com> Date: Wed Nov 12 12:04:16 2025 -0800 [Data] Add `Ranker` Interface (#58513) Creates a ranker interface that will rank the best operator to run next in `select_operator_to_run`. This code only refractors the existing code. The ranking value must be something that is comparable. None None --------- Signed-off-by: iamjustinhsu <jhsu@anyscale.com> commit 9d5a2416e2980501ffc5c094ce5c59709f93ccf2 Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com> Date: Wed Nov 12 11:50:42 2025 -0800 [bazel] upgrade bazel python rules to 0.25.0 (#58535) previously it was actually using 0.4.0, which is set up by the grpc repo. the declaration in the workspace file was being shadowed.. Signed-off-by: Lonnie Liu <lonnie@anyscale.com> commit 02afe68937429bfd6501e4d0f46780bca4dea329 Author: Balaji Veeramani <balaji@anyscale.com> Date: Wed Nov 12 11:34:59 2025 -0800 [Data] Refactor concurrency validation tests in `test_map.py` (#58549) The original `test_concurrency` function combined multiple test scenarios into a single test with complex control flow and expensive Ray cluster initialization. This refactoring extracts the parameter validation tests into focused, independent tests that are faster, clearer, and easier to maintain. Additionally, the original test included "validation" cases that tested valid concurrency parameters but didn't actually verify that concurrency was being limited correctly—they only checked that the output was correct, which isn't useful for validating the concurrency feature itself. **Key improvements:** - Split validation tests into `test_invalid_func_concurrency_raises` and `test_invalid_class_concurrency_raises` - Use parametrized tests for different invalid concurrency values - Switch from `shutdown_only` with explicit `ray.init()` to `ray_start_regular_shared` to eliminate cluster initialization overhead - Minimize test data from 10 blocks to 1 element since we're only validating parameter errors - Remove non-validation tests that didn't verify concurrency behavior N/A The validation tests now execute significantly faster and provide clearer failure messages. Each test has a single, well-defined purpose making maintenance and debugging easier. --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> commit 676b86f4a8d6a4c4eab70f5f381642d9a17fdca2 Author: Balaji Veeramani <balaji@anyscale.com> Date: Wed Nov 12 11:32:48 2025 -0800 [Data] Convert rST-style to Google-style docstrings in `ray.data` (#58523) This PR improves documentation consistency in the `python/ray/data` module by converting all remaining rST-style docstrings (`:param:`, `:return:`, etc.) to Google-style format (`Args:`, `Returns:`, etc.). **Files modified:** - `python/ray/data/preprocessors/utils.py` - Converted `StatComputationPlan.add_callable_stat()` - `python/ray/data/preprocessors/encoder.py` - Converted `unique_post_fn()` - `python/ray/data/block.py` - Converted `BlockColumnAccessor.hash()` and `BlockColumnAccessor.is_composed_of_lists()` - `python/ray/data/_internal/datasource/delta_sharing_datasource.py` - Converted `DeltaSharingDatasource.setup_delta_sharing_connections()` Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> commit 7e872837e450411e9da45acea0c52f4b67221500 Author: Nikhil G <nrghosh@users.noreply.github.com> Date: Wed Nov 12 09:07:32 2025 -0800 [serve][llm] Fix ReplicaContext serialization error in DPRankAssigner (#58504) Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com> commit cd09d104f6d595a805fd8f9979d9f81a828823b5 Author: Alexey Kudinkin <ak@anyscale.com> Date: Wed Nov 12 11:50:05 2025 -0500 [Data] Lowering `DEFAULT_ACTOR_MAX_TASKS_IN_FLIGHT_TO_MAX_CONCURRENCY_FACTOR` to 2 (#58262) > Thank you for contributing to Ray! 🚀 > Please review the [Ray Contribution Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html) before opening a pull request. > ⚠️ Remove these instructions before submitting your PR. > 💡 Tip: Mark as draft if you want early feedback, or ready for review when it's complete. This was setting the value to be aligned with the previous default of 4. However, after some consideration i've realized that 4 is too high of a number so actually lowering this to 2 > Link related issues: "Fixes #1234", "Closes #1234", or "Related to > Optional: Add implementation details, API changes, usage examples, screenshots, etc. Signed-off-by: Alexey Kudinkin <ak@anyscale.com> commit 126a40bc711cf06ed44686ee5026624d6b78766e Author: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com> Date: Wed Nov 12 07:44:53 2025 -0800 [core] fix idle node termination on object pulling (#57928) Currently, a node is considered idle while pulling objects from the remote object store. This can lead to situations where a node is terminated as idle, causing the cluster to enter an infinite loop when pulling large objects that exceed the node idle termination timeout. This PR fixes the issue by treating object pulling as a busy activity. Note that nodes can still accept additional tasks while pulling objects (since pulling consumes no resources), but the auto-scaler will no longer terminate the node prematurely. Closes #54372 Test: - CI Signed-off-by: Cuong Nguyen <can@anyscale.com> commit ad8f30291137efce9e463fb23e6821f4c7c74a9c Author: Sagar Sumit <sagarsumit09@gmail.com> Date: Wed Nov 12 05:40:47 2025 -0800 [core] Use graceful shutdown path when actor OUT_OF_SCOPE (`del actor`) (#57090) When actors terminate gracefully, Ray calls the actor's `__ray_shutdown__()` method if defined, allowing for cleanup of resources. But, this is not invoked in case actor goes out of scope due to `del actor`. Traced through the entire code path, and here's what happens: Flow when `del actor` is called: 1. **Python side**: `ActorHandle.__del__()` -> `worker.core_worker.remove_actor_handle_reference(actor_id)` https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/python/ray/actor.py#L2040 2. **C++ ref counting**: `CoreWorker::RemoveActorHandleReference()` -> `reference_counter_->RemoveLocalReference()` - When ref count reaches 0, triggers `OnObjectOutOfScopeOrFreed` callback https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/core_worker/core_worker.cc#L2503-L2506 3. **Actor manager callback**: `MarkActorKilledOrOutOfScope()` -> `AsyncReportActorOutOfScope()` to GCS https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/core_worker/actor_manager.cc#L180-L183 https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/core_worker/task_submission/actor_task_submitter.cc#L44-L51 4. **GCS receives notification**: `HandleReportActorOutOfScope()` - **THE PROBLEM IS HERE** ([line 279 in `src/ray/gcs/gcs_actor_manager.cc`](https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/gcs/gcs_actor_manager.cc#L279)): ```cpp DestroyActor(actor_id, GenActorOutOfScopeCause(actor), /*force_kill=*/true, // <-- HARDCODED TO TRUE! [reply, send_reply_callback]() { ``` 5. **Actor worker receives kill signal**: `HandleKillActor()` in [`src/ray/core_worker/core_worker.cc`](https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/core_worker/core_worker.cc#L3970) ```cpp if (request.force_kill()) { // This is TRUE for OUT_OF_SCOPE ForceExit(...) // Skips __ray_shutdown__ } else { Exit(...) // Would call __ray_shutdown__ } ``` 6. **ForceExit path**: Bypasses graceful shutdown -> No `__ray_shutdown__` callback invoked. This PR simply changes the GCS to use graceful shutdown for OUT_OF_SCOPE actors. Also, updated the docs. --------- Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com> Co-authored-by: Ibrahim Rabbani <israbbani@gmail.com> commit 15393edbe72f5079279d3a0e46b72adc7496cdfc Author: Sampan S Nayak <sampansnayak2@gmail.com> Date: Wed Nov 12 19:00:10 2025 +0530 [Core] use client interceptor for adding auth token in c++ client calls (#58424) - Use client interceptor for adding auth tokens in grpc calls when `AUTH_MODE=token` - BuildChannel() will automatically include the interceptor - Removed `auth_token` parameter from `ClientCallImpl` - removed manual auth from `python_gcs_subscriber`.cc - tests to verify auth works for autoscaller apis --------- Signed-off-by: sampan <sampan@anyscale.com> Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: Sampan S Nayak <sampansnayak2@gmail.com> Co-authored-by: sampan <sampan@anyscale.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> commit d496ea87808706333703be6ff25ecc9472330fd5 Author: Sampan S Nayak <sampansnayak2@gmail.com> Date: Wed Nov 12 11:25:11 2025 +0530 [core] Token auth usability improvements (#58408) - rename RAY_auth_mode → RAY_AUTH_MODE environment variable across codebase - Excluded healthcheck endpoints from authentication for Kubernetes compatibility - Fixed dashboard cookie handling to respect auth mode and clear stale tokens when switching clusters --------- Signed-off-by: sampan <sampan@anyscale.com> Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: Sampan S Nayak <sampansnayak2@gmail.com> Co-authored-by: sampan <sampan@anyscale.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> commit 584f5acdf804b1ba097ff7fa5d78a0bfd63c682b Author: kourosh hakhamaneshi <31483498+kouroshHakha@users.noreply.github.com> Date: Tue Nov 11 19:50:52 2025 -0800 [doc][serve][llm] Attached the correct figure to the pd docs (#58543) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> commit a15f5be797ced0df321bfd8d42bab7d57defa2de Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com> Date: Tue Nov 11 18:00:43 2025 -0800 [doc] downgrade readthedocs to use python 3.10 (#58536) be consistent with the default build environment Signed-off-by: Lonnie Liu <lonnie@anyscale.com> commit 9dcb67dc9ff20d9b9ae29875bb610273ba4149ed Author: Dhyey Shah <dhyey2019@gmail.com> Date: Tue Nov 11 17:26:15 2025 -0800 [core] Fix auth test import (#58554) The python test step is failing on master now because of this. Probably a logical merge conflict. ``` FAILED: //python/ray/tests:test_grpc_authentication_server_interceptor (Summary) ... [2025-11-11T22:11:54Z] from ray.tests.authentication_test_utils import ( -- | [2025-11-11T22:11:54Z] ModuleNotFoundError: No module named 'ray.tests.authentication_test_utils' ``` Signed-off-by: dayshah <dhyey2019@gmail.com> commit 20bf68263beed3609e24aede3d9fc96bc07f0da0 Author: Dhyey Shah <dhyey2019@gmail.com> Date: Tue Nov 11 12:44:05 2025 -0800 [core][rdt] Abort NIXL and allow actor reuse on failed transfers (#56783) Signed-off-by: dayshah <dhyey2019@gmail.com> commit 89a329cd1e0219629132abc203085117a11949f3 Author: Dhyey Shah <dhyey2019@gmail.com> Date: Tue Nov 11 12:26:17 2025 -0800 [core] Improve kill actor logs (#58544) Signed-off-by: dayshah <dhyey2019@gmail.com> commit 6c9607ea57b9edde07c856f094835c84f47b79a6 Author: Nikhil G <nrghosh@users.noreply.github.com> Date: Tue Nov 11 12:16:41 2025 -0800 [docs][serve][llm] examples and doc for cross-node TP/PP in Serve (#57715) Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com> Signed-off-by: Nikhil G <nrghosh@users.noreply.github.com> commit 711d9453828fecebb91b9642e799b4b0b4a493f7 Author: Dhyey Shah <dhyey2019@gmail.com> Date: Tue Nov 11 12:13:13 2025 -0800 [core] Make GlobalState lazy initialization thread-safe (#58182) Signed-off-by: dayshah <dhyey2019@gmail.com> commit fd10c39829a580bd83ba28c8518e7a7a5ebd3dfb Author: Kai-Hsun Chen <kaihsun@anyscale.com> Date: Tue Nov 11 09:43:05 2025 -0800 [core] Scheduling a detached actor with a placement group is not recommended (#57726)    If users schedule a detached actor into a placement group, Raylet will kill the actor when the placement group is removed. The actor will be stuck in the `RESTARTING` state forever if it's restartable until users explicitly kill it. In that case, if users try to `get_actor` with the actor's name, it can still return the restarting actor, but no process exists. It will no longer be restarted because the PG is gone, and no PG with the same ID will be created during the cluster's lifetime. The better behavior would be for Ray to transition a task/actor's state to dead when it is impossible to restart. However, this would add too much complexity to the core, so I think it's not worth it. Therefore, this PR adds a warning log, and users should use detached actors or PGs correctly. Example: Run the following script and run `ray list actors`. ```python import ray from ray.util.scheduling_strategies import PlacementGroupSchedulingStrategy from ray.util.placement_group import placement_group, remove_placement_group @ray.remote(num_cpus=1, lifetime="detached", max_restarts=-1) class Actor: pass ray.init() pg = placement_group([{"CPU": 1}]) ray.get(pg.ready()) actor = Actor.options( scheduling_strategy=PlacementGroupSchedulingStrategy( placement_group=pg, ) ).remote() ray.get(actor.__ray_ready__.remote()) ```  **Testing:** - [ ] Added/updated tests for my changes - [x] Tested the changes manually - [ ] This PR is not tested ❌ _(please explain why)_ **Code Quality:** - [x] Signed off every commit (`git commit -s`) - [x] Ran pre-commit hooks ([setup guide](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) **Documentation:** - [ ] Updated documentation (if applicable) ([contribution guide](https://docs.ray.io/en/latest/ray-contribute/docs.html)) - [ ] Added new APIs to `doc/source/` (if applicable)  --------- Signed-off-by: Kai-Hsun Chen <khchen@x.ai> Signed-off-by: Robert Nishihara <robertnishihara@gmail.com> Signed-off-by: Kai-Hsun Chen <kaihsun@apache.org> Co-authored-by: Robert Nishihara <robertnishihara@gmail.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> commit 0752886e7d55694b6cf8d780b7470d58266c6a10 Author: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com> Date: Tue Nov 11 07:19:19 2025 -0800 [core] enable open telemetry by default (#56432) This PR enables open telemetry as the default backend for ray metric stack. The bulk of this PR is actually to fix tests that were written with some assumptions that no longer hold true. For ease of reviewing, I inline the reasons for the change together with the change for each tests in the comments. This PR also depends on a release of vllm (so that we can update the minimal supported version of vllm in ray). Test: - CI  --- > [!NOTE] > Enable OpenTelemetry metrics backend by default and refactor metrics/Serve tests to use timeseries APIs and updated `ray_serve_*` metric names. > > - **Core/Config**: > - Default-enable OpenTelemetry: set `RAY_enable_open_telemetry` to `true` in `ray_constants.py` and `ray_config_def.h`. > - Metrics `Counter`: use `CythonCount` by default; keep legacy `CythonSum` only when OTEL is explicitly disabled. > - **Serve/Metrics Tests**: > - Replace text scraping with `PrometheusTimeseries` and `fetch_prometheus_metric_timeseries` throughout. > - Update metric names/tags to `ray_serve_*` and counter suffixes `*_total`; adjust latency metric names and processing/queued gauges. > - Reduce ad-hoc HTTP scrapes; plumb a reusable `timeseries` object and pass through helpers. > - **General Test Fixes**: > - Remove OTEL parametrization/fixtures; simplify expectations where counters-as-gauges no longer apply; drop related tests. > - Cardinality tests: include `"low"` level and remove OTEL gating; stop injecting `enable_open_telemetry` in system config. > - Actor/state/thread tests: migrate to cluster fixtures, wait for dashboard agent, and adjust expected worker thread counts. > - **Build**: > - Remove OTEL-specific Bazel test shard/env overrides; clean OTEL env from C++ stats test. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 1d0190f3dd58d5f0c982fcbdab95fcf5f733553f. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup>  --------- Signed-off-by: Cuong Nguyen <can@anyscale.com> commit bf595e32d049503f5c1931c5b477647a06d191c2 Author: Sampan S Nayak <sampansnayak2@gmail.com> Date: Tue Nov 11 19:15:41 2025 +0530 [Core] move authentication_test_utils into ray._private to fix macos tests (#58528) the auth token test setup in `conftest.py` is breaking macos test. there are two test scripts (`test_microbenchmarks.py` and `test_basic.py`) that run after the wheel is installed but without editable mode. for these test to pass,` conftest.py` cannot import anything under `ray.tests`. this pr moves `authentication_test_utils` into `ray._private` to fix this issue Signed-off-by: sampan <sampan@anyscale.com> Co-authored-by: sampan <sampan@anyscale.com> commit 3d29c4ccc9182c44d3cfab08fb561cb7db74eea8 Author: Sampan S Nayak <sampansnayak2@gmail.com> Date: Tue Nov 11 19:10:56 2025 +0530 [Core] Add Service Interceptor to support token authentication in dashboard agent (#58405) Add a grpc service interceptor to intercept all dashboard agent rpc calls and validate the presence of auth token (when auth mode is token) --------- Signed-off-by: sampan <sampan@anyscale.com> Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com> Co-authored-by: sampan <sampan@anyscale.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> commit 1a48e7318442d038f2c43d22da3b580fa643b8d1 Author: curiosity-hyf <curiooosity.h@gmail.com> Date: Tue Nov 11 21:35:42 2025 +0800 [Docs] fix pattern_async_actor demo typo (#58486) fix pattern_async_actor demo typo. Add `self.`. --------- Signed-off-by: curiosity-hyf <curiooosity.h@gmail.com> commit f2a7a94a75b007a801ee5a2cf6a6e24b93e9cb9a Author: Thomas Desrosiers <681004+thomasdesr@users.noreply.github.com> Date: Mon Nov 10 18:28:46 2025 -0800 Update pydoclint to version 0.8.1 (#58490) * Does the work to bump pydoclint up to the latest version * And allowlist any new violations it finds n/a n/a --------- Signed-off-by: Thomas Desrosiers <thomas@anyscale.com> commit 10983e8c9f50ddfa355efe7977d056b29b38d4c1 Author: Goutam <goutam@anyscale.com> Date: Mon Nov 10 17:34:13 2025 -0800 [Data] - Iceberg support predicate & projection pushdown (#58286) Predicate pushdown (https://github.com/ray-project/ray/pull/58150) in conjunction with this PR should speed up reads from Iceberg. Once the above change lands, we can add the pushdown interface support for IcebergDatasource --------- Signed-off-by: Goutam <goutam@anyscale.com> commit 09f01135f4ab71d52be7a44d06e40ff3767f6cee Author: Seiji Eicher <58963096+eicherseiji@users.noreply.github.com> Date: Mon Nov 10 17:28:23 2025 -0800 [serve][llm] Fix import path in muli-node release test (#58498) Signed-off-by: Seiji Eicher <seiji@anyscale.com> commit 405c4648c2fe71afb7daf4ea574605190f129fd7 Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com> Date: Mon Nov 10 16:04:48 2025 -0800 [ci] upgrade rayci version (#58514) to 0.21.0; supports wanda priority now. Signed-off-by: Lonnie Liu <lonnie@anyscale.com> commit 6de012fd0df23993054653ca5517a66944c58dd2 Author: Zac Policzer <zac@anyscale.com> Date: Mon Nov 10 14:05:15 2025 -0800 [core] Add owned object spill metrics (#57870) This PR adds 2 new metrics to core_worker by way of the reference counter. The two new metrics keep track of the count and size of objects owned by the worker as well as keeping track of their states. States are defined as: - **PendingCreation**: An object that is pending creation and hasn't finished it's initialization (and is sizeless) - **InPlasma**: An object which has an assigned node address and isn't spilled - **Spilled**: An object which has an assigned node address and is spilled - **InMemory**: An object which has no assigned address but isn't pending creation (and therefore, must be local) The approach used by these new metrics is to examine the state 'before and after' any mutations on the reference in the reference_counter. This is required in order to do the appropriate bookkeeping (decrementing values and incrementing others). Admittedly, there is potential for counting on the in between decrements/increments depending on when the RecordMetrics loop is run. This unfortunate side effect however seems preferable to doing mutual exclusion with metric collection as this is potentially a high throughput code path. In addition, performing live counts seemed preferable then doing full accounting of the object store and across all references at time of metric collection. Reason being, that potentially the reference counter is tracking millions of objects, and each metric scan could potentially be very expensive. So running the accounting (despite being potentially innaccurate for short periods) seemed the right call. This PR also allows for object size to potentially change due to potential non deterministic instantation (say an object is initially created, but it's primary copy dies, and then the recreation fails). This is an edge case, but seems important for completeness sake. --------- Signed-off-by: zac <zac@anyscale.com> commit f2dd0e2b6dc7bc074f72197ff08f7d4e58635052 Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com> Date: Mon Nov 10 14:02:11 2025 -0800 [java] remove local genrule `//java:ray_java_pkg` (#58503) using `bazelisk run //java:gen_ray_java_pkg` everywhere Signed-off-by: Lonnie Liu <lonnie@anyscale.com> commit b23adc777c5b103291cf3a35b51b123a808d36f6 Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com> Date: Mon Nov 10 14:01:27 2025 -0800 [ci] apply isort to release test directory, part 1 (#58505) excluding `*_tests` directories for now to reduce the impact Signed-off-by: Lonnie Liu <lonnie@anyscale.com> commit ce1fd472b2677069a5bfcd2b5ed7a2695f5f2966 Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com> Date: Mon Nov 10 14:01:06 2025 -0800 [doc] change link check to run on python 3.12 (#58506) migrating all doc related things to run on python 3.12 Signed-off-by: Lonnie Liu <lonnie@anyscale.com> commit b09b076e15fefe842a0b7e33accff71ec3c31435 Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com> Date: Mon Nov 10 14:00:01 2025 -0800 [doc] ci: move doc annotation check to python 3.12 (#58507) be consistent with doc build environment Signed-off-by: Lonnie Liu <lonnie@anyscale.com> commit 8971f83ecb40d54729c2c26d394594c29199e19d Author: iamjustinhsu <140442892+iamjustinhsu@users.noreply.github.com> Date: Mon Nov 10 12:52:43 2025 -0800 [data] Clear queue for manually mark_execution_finished operators (#58441) Currently, we clear _external_ queues when an operator is manually marked as finished. But we don't clear their _internal_ queues. This PR fixes that Fixes this test https://buildkite.com/ray-project/postmerge/builds/14223#019a5791-3d46-4ab8-9f97-e03ea1c04bb0/642-736 --------- Signed-off-by: iamjustinhsu <jhsu@anyscale.com> commit ffb51f866802ad3858d82a9356855a38503efec9 Author: Matthew Owen <mowen@anyscale.com> Date: Mon Nov 10 10:54:34 2025 -0800 [data] Update depsets for multimodal inference release tests (#57233) Update remaining mulitmodal release tests to use new depsets. commit 62231dd4ba8e784da8800b248ad7616b8db92de7 Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com> Date: Mon Nov 10 10:30:00 2025 -0800 [ci] seperate doc related jobs into its own group (#58454) so that they are not called lints any more Signed-off-by: Lonnie Liu <lonnie@anyscale.com> commit 3f7a7b42fda0bb75a9af6e5ad197ba3743b011c2 Author: harshit-anyscale <harshit@anyscale.com> Date: Mon Nov 10 23:45:38 2025 +0530 increase timeout for test_initial_replica tests (#58423) - `test_target_capacity` windows test is failing, possibly because we have put up a short timeout of 10 seconds, increasing it to verify whether timeout is an issue or not. Signed-off-by: harshit <harshit@anyscale.com> commit 217031a48f4f83d04950ad39b94846ba362edd37 Author: Jugal Shah <47508441+jugalshah291@users.noreply.github.com> Date: Mon Nov 10 09:39:43 2025 -0800 Define an env for controlling UVloop (#58442) > Briefly describe what this PR accomplishes and why it's needed. Our serve ingress keeps running into below error related to `uvloop` under heavy load ``` File descriptor 97 is used by transport ``` The uvloop team have a [PR](https://github.com/MagicStack/uvloop/pull/646) to fix it, but seems like no one is working on it One of workaround mentioned in the ([PR](https://github.com/MagicStack/uvloop/pull/646#issuecomment-3138886982)) is to just turn off uvloop . We tried it in our env and didn't see any major performance difference Hence as part of this PR, we are defining a new env for controlling UVloop Signed-off-by: jugalshah291 <shah.jugal291@gmail.com> commit 2486ddd9fec83cc940937e3d91368942588ef177 Author: fscnick <6858627+fscnick@users.noreply.github.com> Date: Mon Nov 10 23:29:03 2025 +0800 [Doc][KubeRay] eliminate vale errors (#58429) Fix some vale's error and suggestions on the kai-scheduler document. See https://github.com/ray-project/ray/pull/58161#discussion_r2463701719 Signed-off-by: fscnick <fscnick.dev@gmail.com> commit cb6a60d0afcfca87734a399291343e297031f1d5 Author: Daniel Sperber <github.blurry@9ox.net> Date: Mon Nov 10 16:24:34 2025 +0100 [air] Add stacklevel option to deprecation_warning (#58357) Currently are deprecation warnings sometimes not informative enough. The the warning is triggered it does not tell us *where* the deprecated feature is used. For example, ray internally raises a deprecation warning when an `RLModuleConfig` is initialized. ```python >>> from ray.rllib.core.rl_module.rl_module import RLModuleConfig >>> RLModuleConfig() 2025-11-02 18:21:27,318 WARNING deprecation.py:50 -- DeprecationWarning: `RLModule(config=[RLModuleConfig object])` has been deprecated. Use `RLModule(observation_space=.., action_space=.., inference_only=.., model_config=.., catalog_class=..)` instead. This will raise an error in the future! ``` This is confusing, where did *I* use a config, what am I doing wrong? This raises issues like: https://discuss.ray.io/t/warning-deprecation-py-50-deprecationwarning-rlmodule-config-rlmoduleconfig-object-has-been-deprecated-use-rlmodule-observation-space-action-space-inference-only-model-config-catalog-class-instead/23064 Tracing where the error actually happens is tedious - is it my code or internal? The output just shows `deprecation.:50`. Not helpful. This PR adds a stacklevel option with stacklevel=2 as the default to all `deprecation_warning`s. So devs and users can better see where is the deprecated option actually used. --- EDIT: **Before** ```python WARNING deprecation.py:50 -- DeprecationWarning: `RLModule(config=[RLModuleConfig object])` ``` **After** module.py:line where the deprecated artifact is used is shown in the log output: When building an Algorithm: ```python WARNING rl_module.py:445 -- DeprecationWarning: `RLModule(config=[RLModuleConfig object])` has been deprecated. Use `RLModule(observation_space=.., action_space=.., inference_only=.., model_config=.., catalog_class=..)` instead. This will raise an error in the future! ``` ```python .../ray/tune/logger/unified.py:53: RayDeprecationWarning: This API is deprecated and may be removed in future Ray releases. You could suppress this warning by setting env variable PYTHONWARNINGS="ignore::DeprecationWarning" ``` Signed-off-by: Daraan <github.blurry@9ox.net> commit 5bff52ab5d9a9d67de88c4f0b86c918487ed7216 Author: Sampan S Nayak <sampansnayak2@gmail.com> Date: Mon Nov 10 20:50:21 2025 +0530 [core] Configure an interceptor to pass auth token in python direct g… (#58395) there are places in the python code where we use the raw grpc library to make grpc calls (eg: pub-sub, some calls to gcs etc). In the long term we want to fully deprecate grpc library usage in our python code base but as that can take more effort and testing, in this pr I am introducing an interceptor to add auth headers (this will take effect for all grpc calls made from python). ``` export RAY_auth_mode="token" export RAY_AUTH_TOKEN="abcdef1234567890abcdef123456789" ray start --head ray job submit -- echo "hi" ``` output ``` ray job submit -- echo "hi" 2025-11-04 06:28:09,122 - INFO - NumExpr defaulting to 4 threads. Job submission server address: http://127.0.0.1:8265 ------------------------------------------------------- Job 'raysubmit_1EV8q86uKM24nHmH' submitted successfully ------------------------------------------------------- Next steps Query the logs of the job: ray job logs raysubmit_1EV8q86uKM24nHmH Query the status of the job: ray job status raysubmit_1EV8q86uKM24nHmH Request the job to be stopped: ray job stop raysubmit_1EV8q86uKM24nHmH Tailing logs until the job exits (disable with --no-wait): 2025-11-04 06:28:10,363 INFO job_manager.py:568 -- Runtime env is setting up. hi Running entrypoint for job raysubmit_1EV8q86uKM24nHmH: echo hi ------------------------------------------ Job 'raysubmit_1EV8q86uKM24nHmH' succeeded ------------------------------------------ ``` dashboard test.py ```python import time import ray from ray._raylet import Config ray.init() @ray.remote def print_hi(): print("Hi") time.sleep(2) @ray.remote class SimpleActor: def __init__(self): self.value = 0 def increment(self): self.value += 1 return self.value actor = SimpleActor.remote() result = ray.get(actor.increment.remote()) for i in range(100): ray.get(print_hi.remote()) time.sleep(20) ray.shutdown() ``` ``` export RAY_auth_mode="token" export RAY_AUTH_TOKEN="abcdef1234567890abcdef123456789" python test.py ``` <img width="1720" height="1073" alt="image" src="https://github.com/user-attachments/assets/008829d8-51b6-445a-b135-5f76b6ccf292" /> overview page <img width="1720" height="1073" alt="image" src="https://github.com/user-attachments/assets/cece0da7-0edd-4438-9d60-776526b49762" /> job page: tasks are listed <img width="1720" height="1073" alt="image" src="https://github.com/user-attachments/assets/b98eb1d9-cacc-45ea-b0e2-07ce8922202a" /> task page <img width="1720" height="1073" alt="image" src="https://github.com/user-attachments/assets/09ff38e1-e151-4e34-8651-d206eb8b5136" /> actors page <img width="1720" height="1073" alt="image" src="https://github.com/user-attachments/assets/10a30b3d-3f7e-4f3d-b669-962056579459" /> specific actor page <img width="1720" height="1073" alt="image" src="https://github.com/user-attachments/assets/ab1915bd-3d1b-4813-8101-a219432a55c0" /> --------- Signed-off-by: sampan <sampan@anyscale.com> Co-authored-by: sampan <sampan@anyscale.com> commit 71c7bd056cc132c57a4c3cf13d0f5207cbcfd73f Author: Xinyu Zhang <60529799+xyuzh@users.noreply.github.com> Date: Sun Nov 9 08:34:46 2025 -0800 [Data] Add exception handling for invalid URIs in download operation (#58464) commit d74c1570543045a0f99df4d5690ac44f1fda4a55 Author: iamjustinhsu <140442892+iamjustinhsu@users.noreply.github.com> Date: Sat Nov 8 15:35:11 2025 -0800 [dashboards][core] Make `do_reply` accept status_code, instead of success: bool (#58384) Pass in `status_code` directly into `do_reply`. This is a follow up to https://github.com/ray-project/ray/pull/58255 --------- Signed-off-by: iamjustinhsu <jhsu@anyscale.com> commit e793631896f65a88513510b4e7bf6f100607cb03 Author: Rueian <rueiancsie@gmail.com> Date: Sat Nov 8 15:32:10 2025 -0800 [core][autoscaler] Fix RAY_NODE_TYPE_NAME handling when autoscaler is in read-only mode (#58460) This ensures node type names are correctly reported even when the autoscaler is disabled (read-only mode). Autoscaler v2 fails to report prometheus metrics when operating in read-only mode on KubeRay with the following KeyError error: ``` 2025-11-08 12:06:57,402 ERROR autoscaler.py:215 -- 'small-group' Traceback (most recent call last): File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/autoscaler.py", line 200, in update_autoscaling_state return Reconciler.reconcile( File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 120, in reconcile Reconciler._step_next( File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 275, in _step_next Reconciler._scale_cluster( File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/instance_manager/reconciler.py", line 1125, in _scale_cluster reply = scheduler.schedule(sched_request) File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/scheduler.py", line 933, in schedule ResourceDemandScheduler._enforce_max_workers_per_type(ctx) File "/home/ray/anaconda3/lib/python3.9/site-packages/ray/autoscaler/v2/scheduler.py", line 1006, in _enforce_max_workers_per_type node_config = ctx.get_node_type_configs()[node_type] KeyError: 'small-group' ``` This happens because the `ReadOnlyProviderConfigReader` populates `ctx.get_node_type_configs()` using node IDs as node types, which is correct for local Ray (where local ray does not have `RAY_NODE_TYPE_NAME` set), but incorrect for KubeRay where `ray_node_type_name` is present and expected wi…

Fix for unexpected socket closures and data leakage under heavy load

752f7c1

1st1 reviewed Nov 25, 2024

View reviewed changes

bdraco mentioned this pull request Jan 27, 2025

RuntimeError: File descriptor 2877 is used by transport <TCPTransport closed=False reading=True 0x55b8dc9baa90> aio-libs/aiohttp#10362

Closed

1 task

bdraco mentioned this pull request Feb 14, 2025

Close the socket if there's a failure in start_connection() aio-libs/aiohttp#10464

Merged

5 tasks

top-oai mentioned this pull request Feb 16, 2025

RuntimeError: File descriptor 2877 is used by transport <TCPTransport closed=False reading=True 0x55b8dc9baa90> #653

Open

bdraco mentioned this pull request Mar 12, 2025

OSError: [Errno 9] Bad file descriptor with uvloop aio-libs/aiohttp#10506

Open

1 task

bdraco mentioned this pull request Mar 31, 2025

BlockingIOError and File descriptor xx is used by transport aio-libs/aiohttp#10617

Closed

1 task

bdraco added a commit to aio-libs/aiohttp that referenced this pull request Apr 1, 2025

Revert: Close the socket if there's a failure in start_connection() #…

1928761

…10464 fixes #10617 alternative fix is MagicStack/uvloop#646

bdraco mentioned this pull request Apr 1, 2025

Revert: Close the socket if there's a failure in start_connection() #10464 aio-libs/aiohttp#10656

Merged

patchback bot mentioned this pull request Apr 1, 2025

[PR #10656/06db052e backport][3.11] Revert: Close the socket if there's a failure in start_connection() #10464 aio-libs/aiohttp#10657

Merged

patchback bot mentioned this pull request Apr 1, 2025

[PR #10656/06db052e backport][3.12] Revert: Close the socket if there's a failure in start_connection() #10464 aio-libs/aiohttp#10658

Merged

Merge branch 'master' into master

1590a8e

bdraco mentioned this pull request May 26, 2025

aiohttp.ClientSession() does not send TCP FIN aio-libs/aiohttp#4685

Open

GabrielSalla added a commit to GabrielSalla/sentinela that referenced this pull request Jun 1, 2025

remove uvloop package

66946cb

Uvloop has a bug that is preventing from updating the libraries so it's being deactivated for now. MagicStack/uvloop#646

GabrielSalla added a commit to GabrielSalla/sentinela that referenced this pull request Jun 1, 2025

remove uvloop package

821fc81

Uvloop has a bug that is preventing from updating the libraries so it's being deactivated for now. MagicStack/uvloop#646

bdraco mentioned this pull request Sep 24, 2025

Workaround uvloop losing track of sockets when passing a socket to create_connection aio-libs/aiohttp#11539

Draft

5 tasks

jugalshah291 mentioned this pull request Nov 7, 2025

Define an env for controlling UVloop ray-project/ray#58442

Merged

Fix for unexpected socket closures and data leakage under heavy load #646

Are you sure you want to change the base?

Fix for unexpected socket closures and data leakage under heavy load #646

Uh oh!

Conversation

todddialpad commented Nov 25, 2024

Uh oh!

1st1 Nov 25, 2024

Choose a reason for hiding this comment

Uh oh!

todddialpad Nov 28, 2024

Choose a reason for hiding this comment

Uh oh!

MarkusSintonen commented Nov 26, 2024

Uh oh!

todddialpad commented Nov 26, 2024

Uh oh!

MarkusSintonen commented Nov 26, 2024

Uh oh!

todddialpad commented Dec 2, 2024

Uh oh!

todddialpad commented Dec 4, 2024

Uh oh!

AntonArsentiev commented Jan 27, 2025

Uh oh!

AntonArsentiev commented Feb 3, 2025

Uh oh!

Dreamsorcerer commented Feb 16, 2025

Uh oh!

bdraco commented Mar 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

webknjaz commented Mar 31, 2025

Uh oh!

1st1 commented Apr 16, 2025

Uh oh!

jperezr21 commented Apr 24, 2025

Uh oh!

Dreamsorcerer commented Apr 24, 2025

Uh oh!

jugalshah291 commented May 8, 2025

Uh oh!

jperezr21 commented May 8, 2025

Uh oh!

bdraco commented May 8, 2025

Uh oh!

jugalshah291 commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

todddialpad commented May 9, 2025

Uh oh!

fantix commented May 9, 2025

Uh oh!

todddialpad commented May 9, 2025

Uh oh!

todddialpad commented May 10, 2025

Uh oh!

todddialpad commented May 11, 2025

Uh oh!

jugalshah291 commented May 26, 2025

Uh oh!

GabrielSalla commented Jun 1, 2025

Uh oh!

jugalshah291 commented Jun 16, 2025

Uh oh!

jugalshah291 commented Jul 30, 2025

Uh oh!

MarkusSintonen commented Jul 31, 2025

Uh oh!

yybdyybd commented Sep 25, 2025

Uh oh!

Dreamsorcerer commented Sep 25, 2025

Uh oh!

MarkusSintonen commented Sep 25, 2025

Uh oh!

x0day commented Sep 28, 2025

Uh oh!

Reviewers

Assignees

Labels

bdraco commented Mar 31, 2025 •

edited

Loading

jugalshah291 commented May 9, 2025 •

edited

Loading